Extending Tables with Data from over a Million Websites

نویسندگان

  • Oliver Lehmberg
  • Dominique Ritze
  • Petar Ristoski
  • Kai Eckert
  • Heiko Paulheim
  • Christian Bizer
چکیده

This Big Data Track submission demonstrates how the BTC 2014 dataset, Microdata annotations from thousands of websites, as well as millions of HTML tables are used to extend local tables with additional columns. Table extension is a useful operation within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to extend this table with the headquarter of each company. Or imagine you are a film enthusiast and want to extend a table describing films with attributes like director, genre, and release date of each film. The Mannheim Search Joins Engine automatically performs such table extension operations based on a large data corpus gathered from over a million websites that publish structured data in various formats. Given a local table, the Mannheim Search Joins Engine searches the corpus for additional data describing the entities of the input table. The discovered data are then joined with the local table and their content is consolidated using schema matching and data fusion methods. As result, the user is presented with an extended table and given the opportunity to examine the provenance of the added data. Our experiments show that the Mannheim Search Joins Engine achieves a coverage close to 100% and a precision of around 90% within different application scenarios. 1 Application Example Assume a marketing manager who wants to classify the customers of a company according to different properties of the countries in which the customers are located in order to select those that should be targeted by a marketing campaign. While the data about the customers can be found in the company’s internal data sources, further background information about the customers’ countries is not. Relevant data about countries could for instance include their population, GDP, or human development index. Today, the manager needs to manually search and integrate data about each country using search engines such as Google, access a small set of online databases he knows about, or copyand-paste values from Wikipedia. Manually searching for data is cumbersome and the manager will likely miss a large fraction of the relevant data sources that are available on the Web. The Mannheim Search Joins Engine (MSJ Engine) supports the manager in reaching his goal by automating the data search and data integration tasks, leaving him his core task. 2 O. Lehmberg, D. Ritze, P. Ristoski, H. Paulheim, C. Bizer

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Mannheim Search Join Engine

A Search Join is a join operation which extends a user-provided table with additional attributes based on a large corpus of heterogeneous data originating from the Web or corporate intranets. Search Joins are useful within a wide range of application scenarios: Imagine you are an analyst having a local table describing companies and you want to extend this table with attributes containing the h...

متن کامل

Tuberculosis: Past, Present and Future

  Background Tuberculosis (TB) is the second-most common cause of death from infectious disease (after those due to HIV/AIDS). Roughly one-third of the world's population has been infected with M. tuberculosis, with new infections occurring in about 1% of the population each year. People with active TB can infect 10-15 other people through close contact over the course of a year. Materials and ...

متن کامل

Exposing the Hidden Web: An Analysis of Third-Party HTTP Requests on One Million Websites

This article provides a quantitative analysis of privacy compromising mechanisms on one million popular websites. Findings indicate that nearly nine in ten websites leak user data to parties of which the user is likely unaware of; over six in ten websites spawn third-party cookies; and over eight in ten websites load Javascript code from external parties onto users’ computers. Sites which leak ...

متن کامل

Exposing the Hidden Web: An Analysis of Third-Party HTTP Requests on 1 Million Websites

This article provides a quantitative analysis of privacy compromising mechanisms on one million popular websites. Findings indicate that nearly nine in ten websites leak user data to parties of which the user is likely unaware of; over six in ten websites spawn third-party cookies; and over eight in ten websites load Javascript code from external parties onto users’ computers. Sites which leak ...

متن کامل

Database Support for Path Query Functions

Extending relational database functionality to include data mining primitives is one step towards the greater goal of more closely integrated database and mining systems. This paper describes one such extension, where database technology is used to implement path queries over a graph view of relational data. Partial-path information is pre-computed and stored in a compressed binary format in an...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014